ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
نویسندگان
چکیده
منابع مشابه
Online Generation of Locality Sensitive Hash Signatures
Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be main...
متن کاملRandomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering
In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed.
متن کاملRandomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering
In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed.
متن کاملSimilarity-Based Resource Retrieval in Multi-agent Systems by Using Locality-Sensitive Hash Functions
In this paper we address the problem of retrieving similar resources which are distributed over a multi-agent system (MAS). In distributed environments identification of resources is realized by using cryptographic hash functions like SHA-1. The issue with these functions in connection with similarity search is that they distribute their hash values uniformly over the codomain. Therefore such I...
متن کاملS2JSD-LSH: A Locality-Sensitive Hashing Schema for Probability Distributions
To compare the similarity of probability distributions, the information-theoretically motivated metrics like KullbackLeibler divergence (KL) and Jensen-Shannon divergence (JSD) are often more reasonable compared with metrics for vectors like Euclidean and angular distance. However, existing locality-sensitive hashing (LSH) algorithms cannot support the information-theoretically motivated metric...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Knowledge and Data Engineering
سال: 2020
ISSN: 1041-4347,1558-2191,2326-3865
DOI: 10.1109/tkde.2020.3021176